The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
Recent advances in generative adversarial networks (GANs) have demonstrated the capabilities of generating stunning photo-realistic portrait images. While some prior works have applied such image GANs to unconditional 2D portrait video generation and static 3D portrait synthesis, there are few works successfully extending GANs for generating 3D-aware portrait videos. In this work, we propose PV3D, the first generative framework that can synthesize multi-view consistent portrait videos. Specifically, our method extends the recent static 3D-aware image GAN to the video domain by generalizing the 3D implicit neural representation to model the spatio-temporal space. To introduce motion dynamics to the generation process, we develop a motion generator by stacking multiple motion layers to generate motion features via modulated convolution. To alleviate motion ambiguities caused by camera/human motions, we propose a simple yet effective camera condition strategy for PV3D, enabling both temporal and multi-view consistent video generation. Moreover, PV3D introduces two discriminators for regularizing the spatial and temporal domains to ensure the plausibility of the generated portrait videos. These elaborated designs enable PV3D to generate 3D-aware motion-plausible portrait videos with high-quality appearance and geometry, significantly outperforming prior works. As a result, PV3D is able to support many downstream applications such as animating static portraits and view-consistent video motion editing. Code and models will be released at https://showlab.github.io/pv3d.
translated by 谷歌翻译
Generalist models, which are capable of performing diverse multi-modal tasks in a task-agnostic way within a single model, have been explored recently. Being, hopefully, an alternative to approaching general-purpose AI, existing generalist models are still at an early stage, where modality and task coverage is limited. To empower multi-modal task-scaling and speed up this line of research, we release a generalist model learning system, OFASys, built on top of a declarative task interface named multi-modal instruction. At the core of OFASys is the idea of decoupling multi-modal task representations from the underlying model implementations. In OFASys, a task involving multiple modalities can be defined declaratively even with just a single line of code. The system automatically generates task plans from such instructions for training and inference. It also facilitates multi-task training for diverse multi-modal workloads. As a starting point, we provide presets of 7 different modalities and 23 highly-diverse example tasks in OFASys, with which we also develop a first-in-kind, single model, OFA+, that can handle text, image, speech, video, and motion data. The single OFA+ model achieves 95% performance in average with only 16% parameters of 15 task-finetuned models, showcasing the performance reliability of multi-modal task-scaling provided by OFASys. Available at https://github.com/OFA-Sys/OFASys
translated by 谷歌翻译
尽管变形金刚及其变体构象体在语音识别方面表现出了有希望的表现,但参数化的属性在训练和推理过程中导致了很大的记忆成本。一些作品使用跨层重量分享来减少模型的参数。但是,不可避免的能力损失会损害模型性能。为了解决这个问题,本文提出了通过共享稀疏门控专家的参数效率构象异构体。具体而言,我们使用稀疏门控的专家(MOE)来扩展构型块的容量而不增加计算。然后,共享分组构象块的参数,以减少参数的数量。接下来,为了确保具有不同级别适应表示的灵活性的共享块,我们会单独设计MOE路由器和标准化。此外,我们使用知识蒸馏来进一步提高性能。实验结果表明,与全参数模型相比,所提出的模型用编码器的1/3来实现竞争性能。
translated by 谷歌翻译
整合多个在线社交网络(OSN)对许多下游社交挖掘任务(例如用户偏好建模,建议和链接预测)具有重要意义。但是,不幸的是,伴随着越来越多的隐私问题,泄漏敏感用户信息。如何完全利用来自不同在线社交网络的数据,同时保存用户隐私仍然无法解决。为此,我们提出了一个跨网络的社交用户嵌入框架,即DP-Crosue,以一种隐私性的方式学习用户的全面表示。我们共同考虑具有不同隐私保证的部分调整社交网络的信息。特别是,对于每个异质社交网络,我们首先引入一个混合差异隐私概念,以捕获异构数据类型的隐私期望的变化。接下来,为了找到跨社交网络的用户链接,我们进行了无监督的基于用户嵌入的对齐方式,其中通过异质网络嵌入技术实现了用户嵌入。为了进一步增强用户嵌入,一种新颖的跨网络GCN嵌入模型旨在通过那些对齐用户跨网络传输知识。在三个现实世界数据集上进行的广泛实验表明,我们的方法对用户兴趣预测任务以及捍卫用户属性推理攻击的嵌入进行了重大改进。
translated by 谷歌翻译
由于其在多语言翻译,自动驾驶等方面的广泛应用,因此场景文本识别引起了近年来的兴趣。在本报告中,我们描述了我们对词汇表场上的解决方案的解决方案,该解决方案是词汇表场上的文本理解(OOV-ST)挑战,旨在从自然场景图像中提取胶卷外(OOV)单词。我们基于OCLIP的模型在H-Mean中获得28.59 \%,在ECCV2022 TIE Workshop中对OOV挑战的端到端OOV单词识别曲目排名第一。
translated by 谷歌翻译
该报告介绍了我们对ECCV 2022挑战的获胜者解决方案,挑战了播放视频的文本理解(OOV-ST):裁剪单词识别。这项挑战是在所有内容(TIE)中的ECCV 2022讲习班的背景下举行的,该研讨会(TIE)旨在从自然场景图像中提取出播出的单词。在竞争中,我们首先在合成数据集上进行预训练,然后在训练集中对模型进行数据增强进行微调。同时,针对长期和垂直文本进行了专门训练的另外两个型号。最后,我们将不同模型的输出与不同的层,不同的骨干和不同种子结合在一起。当考虑使用唱歌内和播放量的单词时,我们的解决方案的总体单词准确性为69.73%。
translated by 谷歌翻译
迭代最接近的点(ICP)算法是三维表面配准的几何比喻对齐的最重要算法之一,该算法经常用于计算机视觉任务,包括同时定位和映射(SLAM)任务。在本文中,我们说明了ICP算法的理论原理,如何在表面注册任务中使用以及ICP算法变体的传统分类学。随着SLAM成为一个受欢迎的话题,我们还根据每种SLAM任务的特征,包括SLAM任务是否在线,以及地标是否作为特征作为特征作为功能,我们还介绍了ICP算法的大满贯分类。大满贯任务。我们通过比较几个最新的研究论文并分析其实施细节来综合每种SLAM任务。
translated by 谷歌翻译
视频的行动识别和姿势估计与了解人类动作密切相关,但更多的文献集中于如何仅解决姿势估计任务,从行动识别中。这项研究显示了基于动作识别的Videopose3D的更快,更灵活的培训方法。该模型具有与将要估计的类型相同的动作类型,并且可以单独训练不同类型的动作。有证据表明,对于常见的姿势估计任务,该模型需要相对较少的数据来对原始研究进行相似的结果,并且对于以动作为导向的任务,它在有限的接受程度上优于4.5%的原始研究MPJPE速度误差的现场大小和训练时期。该模型可以处理以动作为导向和常见的姿势估计问题。
translated by 谷歌翻译
语言模型既展示了定量的改进,又展示了新的定性功能,随着规模的增加。尽管它们具有潜在的变革性影响,但这些新能力的特征却很差。为了为未来的研究提供信息,为破坏性的新模型能力做准备,并改善社会有害的效果,至关重要的是,我们必须了解目前和近乎未来的能力和语言模型的局限性。为了应对这一挑战,我们介绍了超越模仿游戏基准(Big Bench)。 Big Bench目前由204个任务组成,由132家机构的442位作者贡献。任务主题是多样的,从语言学,儿童发展,数学,常识性推理,生物学,物理学,社会偏见,软件开发等等。 Big-Bench专注于被认为超出当前语言模型的功能的任务。我们评估了OpenAI的GPT型号,Google内部密集变压器体系结构和大型基础上的开关稀疏变压器的行为,跨越了数百万到数十亿个参数。此外,一个人类专家评估者团队执行了所有任务,以提供强大的基准。研究结果包括:模型性能和校准都随规模改善,但绝对的术语(以及与评估者的性能相比);在模型类中的性能非常相似,尽管带有稀疏性。逐渐和预测的任务通常涉及大量知识或记忆成分,而在临界规模上表现出“突破性”行为的任务通常涉及多个步骤或组成部分或脆性指标;社交偏见通常会随着含糊不清的环境而随着规模而增加,但这可以通过提示来改善。
translated by 谷歌翻译